19.3 MAP Inference and Sparse Coding¶

An alternative form of inference p(v|h) is to compute the single most likely value of the missing variables, rather than to infer the entire distribution over their possible values.

\[h^* = argmax_h p(h|v)\]

AKA, maximum a posteriori inference.

It is helpful to think of MAP inference as a procedure that provides a value of q of L(v, h, q) or ELBO. In this sense we can think of MAP inference as approximate inference, because it does not provide the optimal q.

Review on ELBO:

\[L(v, \theta, h) = E_{h\sim q}[\log p(x, h)] + H(q)\]

with respect to q over an unrestricted family of probablistic disrtibution, using an exact optimization algorithm. We can derive MAP inference as a form of approximate inference by restricting the family of distribution q may be drawn from.

\[q(h|v) = \delta (h-\mu)\]

This means that we can control q entirely via \(\mu\). Dropping terms of L that do not vary with \(\mu\), we are left with the optimization problem:

\[\mu^* = argmax_{\mu} \log p(h=\mu, v)\]

equivalent to MAP inference problem:

\[h^* = argmax_h p(h|v)\]

Learning procedure similar to EM algorithm, which alternate between:

Perform MAP inference to infer \(h^*\)
Update \(\theta\) to increase \(\log p(h^*, v)\)

MAP inference is commonly used in DL as both a feature extractor and a learning machanism. It is primarily used for sparse coding models.

Review on Sparse Coding:

Sparse coding is a linear factor model that imposes a sparsity-inducing prior on its hidden units. A common choice is a factorial Laplace prior, with

\[p(h_i) = \frac{\lambda}{2}e^{-\lambda |h_i|}\]

The visible units are then generated by performing a linear transformation and adding noise:

\[p(x|h) = N(v; Wh+b, \beta ^{-1}I)\]

Challenge: Every pair of variables \(h_i\) and \(h_j\) are both parents of v => if v is observed, the graphical model contains an active path connecting \(h_i\) and \(h_j\) => All the hidden units participate in one massive clique in p(h|v). => Computing or even representing p(h|v) is difficult.
Solution: Use MAP inference and learn parameters by maximizing ELBO defined by the Dirac distribution around the MAP estimate of h.

If we concatenate all h vectors in the training set into a matrix H and concatenate all the v vectors into matrix V, then sparse coding learning process consists of minimizing:

\[J(H, W) = \sum_{i, j}|H_{i, j}| + \sum_{i. j} (V - HW^T)^2_{i, j}\]

We can minimize J by alternating between

minimization with repect to H, convex problem
minimization with repect to W, convex problem